In general, data cleaning is a process of investigating your data for inaccuracies, or recoding it in a way that makes it more manageable.
⚠️ MOST IMPORTANT RULE - LOOK 👀 AT YOUR DATA! ⚠️
In general, data cleaning is a process of investigating your data for inaccuracies, or recoding it in a way that makes it more manageable.
⚠️ MOST IMPORTANT RULE - LOOK 👀 AT YOUR DATA! ⚠️
paste and paste0Paste can be very useful for joining vectors together:
paste("Visit", 1:5, sep = "_")
[1] "Visit_1" "Visit_2" "Visit_3" "Visit_4" "Visit_5"
paste("Visit", 1:5, sep = "_", collapse = " ")
[1] "Visit_1 Visit_2 Visit_3 Visit_4 Visit_5"
paste("To", "is going be the ", "we go to the store!", sep = "day ")
[1] "Today is going be the day we go to the store!"
# and paste0 can be even simpler see ?paste0
paste0("Visit",1:5)
[1] "Visit1" "Visit2" "Visit3" "Visit4" "Visit5"
paste(1:5)
[1] "1" "2" "3" "4" "5"
paste(1:5, collapse = " ")
[1] "1 2 3 4 5"
Useful String functions
toupper(), tolower() - uppercase or lowercase your data:str_trim() (in the stringr package) or trimws in basestringr::str_squish - trims and replaces double spacesnchar - get the number of characters in a stringstringr packageLike dplyr, the stringr package:
data.frame in dplyrstringrstr_detect, str_subset, str_replace, and str_replace_all search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.
str_detect - returns TRUE if pattern is foundstr_subset - returns only the strings which pattern were detectedx[str_detect(x, pattern)]str_extract - returns only strings which pattern were detected, but ONLY the patternstr_replace - replaces pattern with replacement the first timestr_replace_all - replaces pattern with replacement as many times matchedstringr?modifiers
fixed - match everything exactlyregexp - default - uses regular expressionsignore_case is an option to not have to use tolowerboundary - Match boundaries between things (e.g. words, sentences, characters).Very similar:
Base R
substr(x, start, stop) - substrings from position start to position stopstrsplit(x, split) - splits strings up - returns list!stringr
str_sub(x, start, end) - substrings from position start to position endstr_split(string, pattern) - splits strings up - returns list!In base R, strsplit splits a vector on a string into a list
x = c("I really", "like writing", "R code programs")
(y = strsplit(x, split = " ")) # returns a list
[[1]] [1] "I" "really" [[2]] [1] "like" "writing" [[3]] [1] "R" "code" "programs"
(y2 = stringr::str_split(x, " ")) # returns a list
[[1]] [1] "I" "really" [[2]] [1] "like" "writing" [[3]] [1] "R" "code" "programs"
One example case is when you want to split on a period “.”. In regular expressions . means ANY character, so
str_split("I.like.strings", ".")
[[1]] [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
str_split("I.like.strings", fixed("."))
[[1]] [1] "I" "like" "strings"
purrr or apply* to extract from string listssapply(y, dplyr::first) # on the fly
[1] "I" "like" "R"
purrr::map_chr(y, nth, 2) # on the fly
[1] "really" "writing" "code"
sapply(y, dplyr::last) # on the fly
[1] "really" "writing" "programs"
We can use boundary in the case of str_split as well:
words = c("These are some words.")
str_count(words, boundary("word"))
[1] 4
# split with space str_split(words, " ")[[1]]
[1] "These" "are" "" "" "some" "words."
# split between word
str_split(words, boundary("word"))[[1]]
[1] "These" "are" "some" "words"
. matches any single character* means repeat as many (even if 0) more times the last character? makes the last thing optional^ matches start of vector ^a - starts with “a”$ matches end of vector b$ - ends with “b”x = c("i think we all rule for participating",
"i think i have been outed",
"i think this will be quite fun actually",
"it will be fun, i think")
str_detect(x, "^i think")
[1] TRUE TRUE TRUE FALSE
x = c("well they had something this morning",
"then had to catch a tram home in the morning",
"dog obedience school in the morning",
"this morning I'll go for a run")
str_detect(x, "morning$")
[1] TRUE TRUE TRUE FALSE
x = c("Name the worst thing about Bush!",
"I saw a green bush",
"BBQ and bushwalking at Molonglo Gorge",
"BUSH!!")
str_detect(x,"[Bb][Uu][Ss][Hh]")
[1] TRUE TRUE TRUE TRUE
x = c("7th inning stretch",
"2nd half soon to begin. OSU did just win.",
"3am - cant sleep - too hot still.. :(",
"5ft 7 sent from heaven")
str_detect(x,"^[0-9][a-zA-Z]")
[1] TRUE TRUE TRUE TRUE
I want to match NOT a ? or . at the end of line (fixed with [ ]).
x = c("are you there?",
"2nd half soon to begin. OSU did just win.",
"6 and 9",
"dont worry... we all die anyway!")
str_detect(x,"[^?.]$")
[1] FALSE FALSE TRUE TRUE
x = c("these are post 9-11 rules",
"NetBios: scanning ip 203.169.114.66",
"Front Door 9:11:46 AM",
"Sings: 0118999881999119725...3 !")
str_detect(x, "9.11")
[1] TRUE TRUE TRUE TRUE
x = c("Not a whole lot of hurricanes.",
"We do have floods nearly every day",
"hurricanes swirl in the other direction",
"coldfire is STRAIGHT!")
str_detect(x, "flood|earthquake|hurricane|coldfire")
[1] TRUE TRUE TRUE TRUE
x = c("206-555-1122","206-332","4545","test")
phone = "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"
str_detect(x,phone)
[1] TRUE FALSE FALSE FALSE
suppressMessages({
raw_salary_data = Sal = readr::read_csv(
"https://raw.githubusercontent.com/muschellij2/adv_data_sci_2023/main/example_data/Baltimore_City_Employee_Salaries_FY2014.csv",
progress = FALSE)
})
head(Sal)
# A tibble: 6 × 7 Name JobTitle AgencyID Agency HireDate AnnualSalary GrossPay <chr> <chr> <chr> <chr> <chr> <chr> <chr> 1 Aaron,Keontae E AIDE BLUE C… W02200 Youth… 06/10/2… $11310.00 $873.63 2 Aaron,Patricia G Facilities/… A03031 OED-E… 10/24/1… $53428.00 $52868.… 3 Aaron,Petra L ASSISTANT S… A29005 State… 09/25/2… $68300.00 $67439.… 4 Abaineh,Yohannes T EPIDEMIOLOG… A65026 HLTH-… 07/23/2… $62000.00 $58654.… 5 Abbene,Anthony M POLICE OFFI… A99416 Polic… 07/24/2… $43999.00 $39686.… 6 Abbey,Emmanuel CONTRACT SE… A40001 M-R I… 05/01/2… $52000.00 $47019.…
stringr and dplyrstr_subset(Sal$Name, "Rawlings")
[1] "Rawlings,Kellye A" "Rawlings,MarqWell D" [3] "Rawlings,Paula M" "Rawlings-Blake,Stephanie C"
Sal %>% filter(str_detect(Name, "Rawlings"))
# A tibble: 4 × 7 Name JobTitle AgencyID Agency HireDate AnnualSalary GrossPay <chr> <chr> <chr> <chr> <chr> <chr> <chr> 1 Rawlings,Kellye A EMERGEN… A40302 M-R I… 01/06/2… $47980.00 $68426.… 2 Rawlings,MarqWell D AIDE BL… W02384 Youth… 06/15/2… $11310.00 $507.50 3 Rawlings,Paula M COMMUNI… A04015 R&P-R… 12/10/2… $19802.00 $8195.79 4 Rawlings-Blake,Stepha… MAYOR A01001 Mayor… 12/07/1… $163365.00 $161219…
stringrWe can do the same thing (with 2 piping operations!) in dplyr
dplyr_sal = Sal %>% mutate(
AnnualSalary = AnnualSalary %>%
str_replace(fixed("$"), "") %>%
as.numeric) %>%
arrange(desc(AnnualSalary))
str_extract and str_extract_allstr_extract_all extracts all the matched strings - \\d searches for DIGITS/numbers
head(str_extract(Sal$AgencyID, "\\d"))
[1] "0" "0" "2" "6" "9" "4"
head(str_extract_all(Sal$AgencyID, "\\d"), 2)
[[1]] [1] "0" "2" "2" "0" "0" [[2]] [1] "0" "3" "0" "3" "1"
grep: grep, grepl, regexpr and gregexpr search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.
grep(pattern, x, fixed=FALSE), where:
pattern = character string containing a regular expression to be matched in the given character vector.
x = a character vector where matches are sought, or an object which can be coerced by as.character to a character vector.
If fixed=TRUE, it will do exact matching for the phrase anywhere in the vector (regular find)
Base R does not use these functions. Here is a “translator” of the stringr function to base R functions
str_detect - similar to grepl (return logical)grep(value = FALSE) is similar to which(str_detect())str_subset - similar to grep(value = TRUE) - return value of matchedstr_replace - similar to sub - replace one timestr_replace_all - similar to gsub - replace many timesBase R:
(pattern, x)(fixed = TRUE)stringr
(string, pattern) aka (x, pattern)fixed(pattern)These are the indices where the pattern match occurs:
grep("Rawlings",Sal$Name)
[1] 13832 13833 13834 13835
which(grepl("Rawlings", Sal$Name))
[1] 13832 13833 13834 13835
which(str_detect(Sal$Name, "Rawlings"))
[1] 13832 13833 13834 13835
These are the indices where the pattern match occurs:
head(grepl("Rawlings",Sal$Name))
[1] FALSE FALSE FALSE FALSE FALSE FALSE
head(str_detect(Sal$Name, "Rawlings"))
[1] FALSE FALSE FALSE FALSE FALSE FALSE
grep("Rawlings", Sal$Name, value=TRUE)
[1] "Rawlings,Kellye A" "Rawlings,MarqWell D" [3] "Rawlings,Paula M" "Rawlings-Blake,Stephanie C"
Sal[grep("Rawlings", Sal$Name), ]
# A tibble: 4 × 7 Name JobTitle AgencyID Agency HireDate AnnualSalary GrossPay <chr> <chr> <chr> <chr> <chr> <chr> <chr> 1 Rawlings,Kellye A EMERGEN… A40302 M-R I… 01/06/2… $47980.00 $68426.… 2 Rawlings,MarqWell D AIDE BL… W02384 Youth… 06/15/2… $11310.00 $507.50 3 Rawlings,Paula M COMMUNI… A04015 R&P-R… 12/10/2… $19802.00 $8195.79 4 Rawlings-Blake,Stepha… MAYOR A01001 Mayor… 12/07/1… $163365.00 $161219…
str_extractstr_extract extracts just the matched string
ss = str_extract(Sal$Name, "Rawling") head(ss)
[1] NA NA NA NA NA NA
ss[!is.na(ss)]
[1] "Rawling" "Rawling" "Rawling" "Rawling"
str_extract and str_extract_allstr_extract_all extracts all the matched strings
head(str_extract(Sal$AgencyID, "\\d"))
[1] "0" "0" "2" "6" "9" "4"
head(str_extract_all(Sal$AgencyID, "\\d"), 2)
[[1]] [1] "0" "2" "2" "0" "0" [[2]] [1] "0" "3" "0" "3" "1"
head(grep("^Payne.*", x = Sal$Name, value = TRUE), 3)
[1] "Payne El,Jackie" "Payne Johnson,Nickole A" [3] "Payne,Chanel"
head(grep("Leonard.?S", x = Sal$Name, value = TRUE))
[1] "Payne,Leonard S" "Szumlanski,Leonard S"
head(grep("Spence.*C.*", x = Sal$Name, value = TRUE))
[1] "Greene,Spencer C" "Spencer,Charles A" "Spencer,Christian O" [4] "Spencer,Clarence W" "Spencer,Michael C"
stringrhead(str_subset(Sal$Name, "^Payne.*"), 3)
[1] "Payne El,Jackie" "Payne Johnson,Nickole A" [3] "Payne,Chanel"
head(str_subset(Sal$Name, "Leonard.?S"))
[1] "Payne,Leonard S" "Szumlanski,Leonard S"
head(str_subset(Sal$Name, "Spence.*C.*"))
[1] "Greene,Spencer C" "Spencer,Charles A" "Spencer,Christian O" [4] "Spencer,Clarence W" "Spencer,Michael C"
Let’s say we wanted to sort the data set by Annual Salary:
class(Sal$AnnualSalary)
[1] "character"
sort(c("1", "2", "10")) # not sort correctly (order simply ranks the data)
[1] "1" "10" "2"
order(c("1", "2", "10"))
[1] 1 3 2
So we must change the annual pay into a numeric:
head(Sal$AnnualSalary, 4)
[1] "$11310.00" "$53428.00" "$68300.00" "$62000.00"
head(as.numeric(Sal$AnnualSalary), 4)
Warning in head(as.numeric(Sal$AnnualSalary), 4): NAs introduced by coercion
[1] NA NA NA NA
R didn’t like the $ so it thought turned them all to NA.
sub() and gsub() can do the replacing part in base R.
Now we can replace the $ with nothing (used fixed=TRUE because $ means ending):
Sal$AnnualSalary = as.numeric(gsub(pattern = "$", replacement="",
Sal$AnnualSalary, fixed=TRUE))
Sal = Sal %>% arrange(desc(AnnualSalary))
Sal %>% select(Name, AnnualSalary, JobTitle)
# A tibble: 18,981 × 3 Name AnnualSalary JobTitle <chr> <dbl> <chr> 1 Bernstein,Gregg L 238772 STATE'S ATTORNEY 2 Charles,Ronnie E 200000 EXECUTIVE LEVEL III 3 Batts,Anthony W 193800 EXECUTIVE LEVEL III 4 Black,Harry E 190000 EXECUTIVE LEVEL III 5 Swift,Michael 187200 CONTRACT SERV SPEC II 6 Parthemos,Kaliope 172000 EXECUTIVE LEVEL III 7 Ford,Niles R 165000 EXECUTIVE LEVEL III 8 Rawlings-Blake,Stephanie C 163365 MAYOR 9 Chow,Rudolph S 163200 DIRECTOR PUBLIC WORKS 10 Nilson,George A 163200 CITY SOLICITOR # ℹ 18,971 more rows
stringrWe can do the same thing (with 2 piping operations!) in dplyr
dplyr_sal = Sal %>% mutate(
AnnualSalary = AnnualSalary %>%
str_replace(
fixed("$"),
"") %>%
as.numeric) %>%
arrange(desc(AnnualSalary))
check_Sal = Sal
rownames(check_Sal) = NULL
all.equal(check_Sal, dplyr_sal) # they are the same
[1] "Attributes: < Names: 1 string mismatch >" [2] "Attributes: < Length mismatch: comparison on first 2 components >" [3] "Attributes: < Component \"class\": Lengths (4, 3) differ (string compare on first 3) >" [4] "Attributes: < Component \"class\": 3 string mismatches >" [5] "Attributes: < Component 2: target is externalptr, current is numeric >"
$ and , in Practicereadr::parse_* is a number of useful helper functions for parsing columns
head(readr::parse_number(raw_salary_data$AnnualSalary))
[1] 11310 53428 68300 62000 43999 52000
raw_salary_data %>%
mutate(across(matches("Salary|Pay"), readr::parse_number)) %>%
select(matches("Salary|Pay"))
# A tibble: 18,981 × 2
AnnualSalary GrossPay
<dbl> <dbl>
1 11310 874.
2 53428 52868.
3 68300 67439.
4 62000 58655.
5 43999 39687.
6 52000 47020.
7 62175 61452.
8 70918 87900.
9 42438 53668.
10 11310 NA
# ℹ 18,971 more rows
( ) do grouping in regex, but you can also reference these groups with the the order in which they are in:
x = c("Are you here?", "I think that is him.", "why didn't they?")
str_replace(
x,
regex(".*(are|think|did).*([.?])", ignore_case = TRUE),
"The verb of the sentence was \\1 and the ending punctuation is \\2")
[1] "The verb of the sentence was Are and the ending punctuation is ?" [2] "The verb of the sentence was think and the ending punctuation is ." [3] "The verb of the sentence was did and the ending punctuation is ?"
Note, you can’t reference something that doesn’t exist (no 3rd group):
str_replace(x,
regex(".*(are|think|did).*([.?])", ignore_case = TRUE),
"verb: \\1, ending punctuation: \\2, and \\3")
Error in stri_replace_first_regex(string, pattern, fix_replacement(replacement), : Trying to access the index that is out of bounds. (U_INDEX_OUTOFBOUNDS_ERROR)
janeaustenrAttached with row numbers (by book).
library(janeaustenr) original_books = austen_books() %>% group_by(book) %>% mutate(linenumber = row_number()) %>% ungroup() head(original_books)
# A tibble: 6 × 3 text book linenumber <chr> <fct> <int> 1 "SENSE AND SENSIBILITY" Sense & Sensibility 1 2 "" Sense & Sensibility 2 3 "by Jane Austen" Sense & Sensibility 3 4 "" Sense & Sensibility 4 5 "(1811)" Sense & Sensibility 5 6 "" Sense & Sensibility 6
http://joss.theoj.org/papers/89fd1099620268fe0342ffdcdf66776f
http://stat545-ubc.github.io/block022_regular-expression.html
unnest_tokenslibrary(tidytext)
txt = c("These are words", "so are these", "this is running on")
sentence = c(1, 2, 3)
dat = tibble(txt, sentence)
unnest_tokens(dat, tok, txt)
# A tibble: 10 × 2
sentence tok
<dbl> <chr>
1 1 these
2 1 are
3 1 words
4 2 so
5 2 are
6 2 these
7 3 this
8 3 is
9 3 running
10 3 on
“The process of segmenting running text into words and sentences.”
tidy_books = original_books %>% unnest_tokens(word, text) head(tidy_books)
# A tibble: 6 × 3 book linenumber word <fct> <int> <chr> 1 Sense & Sensibility 1 sense 2 Sense & Sensibility 1 and 3 Sense & Sensibility 1 sensibility 4 Sense & Sensibility 3 by 5 Sense & Sensibility 3 jane 6 Sense & Sensibility 3 austen
tt = tidy_books %>% filter(book == "Sense & Sensibility") %>% count(word) %>% arrange(desc(n)) %>% slice(1:200L) wordcloud::wordcloud(tt$word, tt$n)
tidy_books %>% group_by(word) %>% tally() %>% arrange(desc(n))
# A tibble: 14,520 × 2 word n <chr> <int> 1 the 26351 2 to 24044 3 and 22515 4 of 21178 5 a 13408 6 her 13055 7 i 12006 8 in 11217 9 was 11204 10 it 10234 # ℹ 14,510 more rows
tidy_books %>% count(book, word) %>% arrange(desc(n)) %>% group_by(book) %>% slice(1L)
# A tibble: 6 × 3 # Groups: book [6] book word n <fct> <chr> <int> 1 Sense & Sensibility to 4116 2 Pride & Prejudice the 4331 3 Mansfield Park the 6206 4 Emma to 5239 5 Northanger Abbey the 3179 6 Persuasion the 3329
head(stop_words)
# A tibble: 6 × 2 word lexicon <chr> <chr> 1 a SMART 2 a's SMART 3 able SMART 4 about SMART 5 above SMART 6 according SMART
(tidy_books = tidy_books %>% anti_join(stop_words, by = "word"))
# A tibble: 217,609 × 3 book linenumber word <fct> <int> <chr> 1 Sense & Sensibility 1 sense 2 Sense & Sensibility 1 sensibility 3 Sense & Sensibility 3 jane 4 Sense & Sensibility 3 austen 5 Sense & Sensibility 5 1811 6 Sense & Sensibility 10 chapter 7 Sense & Sensibility 10 1 8 Sense & Sensibility 13 family 9 Sense & Sensibility 13 dashwood 10 Sense & Sensibility 13 settled # ℹ 217,599 more rows
tidy_books %>% count(word) %>% arrange(desc(n))
# A tibble: 13,914 × 2 word n <chr> <int> 1 miss 1855 2 time 1337 3 fanny 862 4 dear 822 5 lady 817 6 sir 806 7 day 797 8 emma 787 9 sister 727 10 house 699 # ℹ 13,904 more rows
top_book_words = tidy_books %>% count(word, book) %>% arrange(desc(n)) %>% group_by(book) (top_book_words %>% slice(1:2))
# A tibble: 12 × 3 # Groups: book [6] word book n <chr> <fct> <int> 1 elinor Sense & Sensibility 623 2 marianne Sense & Sensibility 492 3 elizabeth Pride & Prejudice 597 4 darcy Pride & Prejudice 373 5 fanny Mansfield Park 816 6 crawford Mansfield Park 493 7 emma Emma 786 8 miss Emma 599 9 catherine Northanger Abbey 428 10 miss Northanger Abbey 206 11 anne Persuasion 447 12 captain Persuasion 303
tt = tidy_books %>% filter(book == "Sense & Sensibility") %>% count(word) %>% arrange(desc(n)) %>% slice(1:200L) wordcloud::wordcloud(tt$word, tt$n)
“I hate this stupid class. But I love the work”
“I hate this stupid class. But I love the work”
“I hate this stupid class. But I love the work”
“Oh yeah, I totally love doing coding sessions”
bing = tidytext::sentiments head(bing)
# A tibble: 6 × 2 word sentiment <chr> <chr> 1 2-faces negative 2 abnormal negative 3 abolish negative 4 abominable negative 5 abominably negative 6 abominate negative
(dupes = bing %>% janitor::get_dupes(word))
# A tibble: 6 × 3 word dupe_count sentiment <chr> <int> <chr> 1 envious 2 positive 2 envious 2 negative 3 enviously 2 positive 4 enviously 2 negative 5 enviousness 2 positive 6 enviousness 2 negative
Let’s remove those cases that it says these duplicates were positive
bing = bing %>% # remove positive envy! anti_join(dupes %>% filter(sentiment == "positive"))
Joining with `by = join_by(word, sentiment)`
anyDuplicated(bing$word) == 0
[1] TRUE
any(duplicated(x)) vs. anyDuplicated(x) == 0 speedFunctions useful for checking and speed (big data): anyNA and `anyDuplicated
microbenchmark::microbenchmark( anyDup = anyDuplicated(bing$word) == 0, any_dup = any(duplicated(bing$word)), anyNA = anyNA(bing$word), any_is_na = any(is.na(bing$word)) )
Unit: microseconds
expr min lq mean median uq max neval cld
anyDup 125.940 134.7985 153.51218 153.0815 162.1125 265.684 100 a
any_dup 136.939 148.0775 164.68555 164.6075 175.2005 205.303 100 b
anyNA 4.690 5.0505 5.76450 5.6890 6.0945 13.296 100 c
any_is_na 21.047 22.2050 23.78423 22.9115 25.2005 41.605 100 d
Miss - may be misclassified (e.g. “Miss Elizabeth”)
top_book_words %>% slice(1:2) %>% left_join(bing, by = join_by(word))
# A tibble: 12 × 4 # Groups: book [6] word book n sentiment <chr> <fct> <int> <chr> 1 elinor Sense & Sensibility 623 <NA> 2 marianne Sense & Sensibility 492 <NA> 3 elizabeth Pride & Prejudice 597 <NA> 4 darcy Pride & Prejudice 373 <NA> 5 fanny Mansfield Park 816 <NA> 6 crawford Mansfield Park 493 <NA> 7 emma Emma 786 <NA> 8 miss Emma 599 negative 9 catherine Northanger Abbey 428 <NA> 10 miss Northanger Abbey 206 negative 11 anne Persuasion 447 <NA> 12 captain Persuasion 303 <NA>
top_book_words %>% inner_join(bing, by = join_by(word)) %>% slice(1:2)
# A tibble: 12 × 4 # Groups: book [6] word book n sentiment <chr> <fct> <int> <chr> 1 miss Sense & Sensibility 210 negative 2 happy Sense & Sensibility 100 positive 3 miss Pride & Prejudice 283 negative 4 love Pride & Prejudice 92 positive 5 miss Mansfield Park 432 negative 6 love Mansfield Park 124 positive 7 miss Emma 599 negative 8 poor Emma 136 negative 9 miss Northanger Abbey 206 negative 10 pleasure Northanger Abbey 48 positive 11 miss Persuasion 125 negative 12 happy Persuasion 64 positive
janeaustensentiment = tidy_books %>% inner_join(bing, by = join_by(word)) %>% count(book, page = linenumber %/% 80, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative) head(janeaustensentiment)
# A tibble: 6 × 5 book page negative positive sentiment <fct> <dbl> <dbl> <dbl> <dbl> 1 Sense & Sensibility 0 16 26 10 2 Sense & Sensibility 1 19 44 25 3 Sense & Sensibility 2 12 23 11 4 Sense & Sensibility 3 15 22 7 5 Sense & Sensibility 4 16 29 13 6 Sense & Sensibility 5 16 39 23
janeaustensentiment %>% group_by(book) %>% slice(1:3)
# A tibble: 18 × 5 # Groups: book [6] book page negative positive sentiment <fct> <dbl> <dbl> <dbl> <dbl> 1 Sense & Sensibility 0 16 26 10 2 Sense & Sensibility 1 19 44 25 3 Sense & Sensibility 2 12 23 11 4 Pride & Prejudice 0 7 15 8 5 Pride & Prejudice 1 20 14 -6 6 Pride & Prejudice 2 15 14 -1 7 Mansfield Park 0 29 29 0 8 Mansfield Park 1 20 30 10 9 Mansfield Park 2 27 23 -4 10 Emma 0 31 35 4 11 Emma 1 28 23 -5 12 Emma 2 30 20 -10 13 Northanger Abbey 0 27 21 -6 14 Northanger Abbey 1 22 32 10 15 Northanger Abbey 2 25 34 9 16 Persuasion 0 20 28 8 17 Persuasion 1 21 28 7 18 Persuasion 2 24 22 -2
In reality, you’d probably do more tidying on words like this, or a more sophisticated NLP approach (or gsub("miss (elizabeth|elinor)", "\\1", x)).
janeaustensentiment_nomiss = tidy_books %>% filter(word != "miss") %>% inner_join(bing, by = join_by(word)) %>% count(book, page = linenumber %/% 80, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative)
ggplot(janeaustensentiment, aes(page, sentiment, fill = book)) + geom_bar(stat = "identity", show.legend = FALSE) + facet_wrap(~book, ncol = 3, scales = "free_x")
janeaustensentiment %>%
group_by(book) %>%
mutate(sentiment = cumsum(sentiment)) %>%
ggplot(aes(page, sentiment, colour = book)) +
geom_line() + ylab("Cumulative Sentiment") + xlab("Pages(ish)")
janeaustensentiment %>%
group_by(book) %>%
mutate(sentiment = cumsum(sentiment),
page = page/max(page)) %>%
ggplot(aes(page, sentiment, colour = book)) +
geom_line() + ylab("Cumulative Sentiment") + xlab("% Pages(ish)")
(p = janeaustensentiment_nomiss %>%
group_by(book) %>%
mutate(sentiment = cumsum(sentiment),
page = page/max(page)) %>%
ggplot(aes(page, sentiment, colour = book)) + geom_line(linewidth = 1.25) +
ylab("Cumulative Sentiment") + xlab("% Pages(ish)"))
transparent_legend = theme(
legend.background = element_rect(fill = "transparent"),
legend.key = element_rect(fill = "transparent",
color = "transparent"))
p +
transparent_legend +
scale_color_brewer(type = "qual") +
scale_x_continuous(labels = scales::percent_format()) +
theme(legend.position = c(0.2, 0.7), text = element_text(size = 20)) +
guides(colour = guide_legend(title = "Book",
override.aes = list(linewidth = 2)))
Can use wordStem to reduce certain words to their primary stem (e.g. remove gerunds/tense):
library(SnowballC)
wordStem(c("running","fasted"))
[1] "run" "fast"
Stemming gets you the root of the word, but the stem may be odd:
tidy_books %>% mutate(stem = SnowballC::wordStem(word)) %>% filter(stem != word) %>% head(20)
# A tibble: 20 × 4 book linenumber word stem <fct> <int> <chr> <chr> 1 Sense & Sensibility 1 sense sens 2 Sense & Sensibility 1 sensibility sensibl 3 Sense & Sensibility 13 family famili 4 Sense & Sensibility 13 settled settl 5 Sense & Sensibility 13 estate estat 6 Sense & Sensibility 14 residence resid 7 Sense & Sensibility 14 centre centr 8 Sense & Sensibility 15 property properti 9 Sense & Sensibility 15 generations gener 10 Sense & Sensibility 15 lived live 11 Sense & Sensibility 16 respectable respect 12 Sense & Sensibility 16 engage engag 13 Sense & Sensibility 17 surrounding surround 14 Sense & Sensibility 17 acquaintance acquaint 15 Sense & Sensibility 17 estate estat 16 Sense & Sensibility 17 single singl 17 Sense & Sensibility 18 lived live 18 Sense & Sensibility 18 advanced advanc 19 Sense & Sensibility 18 age ag 20 Sense & Sensibility 19 housekeeper housekeep
Text Mining in R - tm
Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM) - topicmodels
quanteda: Quantitative Analysis of Textual Data - http://quanteda.io/
Still need to know these because topicmodels take tm objects
tidytext has a number of cast_* functions
library(tm);
data("AssociatedPress", package = "topicmodels"); AssociatedPress
<<DocumentTermMatrix (documents: 2246, terms: 10473)>> Non-/sparse entries: 302031/23220327 Sparsity : 99% Maximal term length: 18 Weighting : term frequency (tf)
class(AssociatedPress); head(tidy(AssociatedPress)) # generics::tidy
[1] "DocumentTermMatrix" "simple_triplet_matrix"
# A tibble: 6 × 3
document term count
<int> <chr> <dbl>
1 1 adding 1
2 1 adult 2
3 1 ago 1
4 1 alcohol 1
5 1 allegedly 1
6 1 allen 1
comparison = tidy(AssociatedPress) %>%
group_by(term) %>%
summarise(AP = sum(count)) %>% # add up the counts
rename(word = term) %>%
inner_join(count(tidy_books, word, name = "Austen")) %>%
mutate(AP = AP / sum(AP),
Austen = Austen / sum(Austen),
diff = AP - Austen) %>%
arrange(desc(abs(diff)))
Joining with `by = join_by(word)`
head(comparison)
# A tibble: 6 × 4 word AP Austen diff <chr> <dbl> <dbl> <dbl> 1 million 0.00679 0.0000213 0.00677 2 government 0.00615 0.0000497 0.00610 3 lady 0.000100 0.00580 -0.00570 4 sir 0.0000870 0.00572 -0.00563 5 time 0.00412 0.00948 -0.00536 6 sister 0.000191 0.00516 -0.00497
tidy_freq = tidy_books %>% dplyr::ungroup() %>% count(book, word, name = "count") head(tidy_freq)
# A tibble: 6 × 3 book word count <fct> <chr> <int> 1 Sense & Sensibility 1 2 2 Sense & Sensibility 10 1 3 Sense & Sensibility 11 1 4 Sense & Sensibility 12 1 5 Sense & Sensibility 13 1 6 Sense & Sensibility 14 1
nonum removes any words that are all numeric (many ways of doing this):
nonum = tidy_freq %>% filter(is.na(as.numeric(word)))
Warning: There was 1 warning in `filter()`. ℹ In argument: `is.na(as.numeric(word))`. Caused by warning: ! NAs introduced by coercion
head(nonum)
# A tibble: 6 × 3 book word count <fct> <chr> <int> 1 Sense & Sensibility 7000l 1 2 Sense & Sensibility abandoned 1 3 Sense & Sensibility abatement 1 4 Sense & Sensibility abbeyland 1 5 Sense & Sensibility abhor 1 6 Sense & Sensibility abhorred 2
tidy_ap = tidy(AssociatedPress) %>%
rename(book = document,
word = term,
count = count)
dat = rbind(tidy_ap, tidy_freq)
head(dat)
# A tibble: 6 × 3 book word count <chr> <chr> <dbl> 1 1 adding 1 2 1 adult 2 3 1 ago 1 4 1 alcohol 1 5 1 allegedly 1 6 1 allen 1
Make a DocuemntTermMatrix/reshape the data (tm object):
dtm = dat %>% cast_dtm(document = book, term = word, value = count) inspect(dtm[1:6,1:10])
<<DocumentTermMatrix (documents: 6, terms: 10)>>
Non-/sparse entries: 15/45
Sparsity : 75%
Maximal term length: 10
Weighting : term frequency (tf)
Sample :
Terms
Docs adding adult ago alcohol allegedly allen apparently appeared arrested
1 1 2 1 1 1 1 2 1 1
2 0 0 0 0 0 0 0 1 0
3 0 0 1 0 0 0 0 1 0
4 0 0 3 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0
6 0 0 2 0 0 0 0 0 0
Terms
Docs assault
1 1
2 0
3 0
4 0
5 0
6 0
Here we normalize documents based on number of words:
dtm = as.matrix(dtm) dtm = dtm/rowSums(dtm)
Show the similarity (based on count correlations with first document):
cor1 = cor(dtm[1,], t(dtm))[1,]; print(cor1[1:5]);
1 2 3 4 5 1.00000000 0.01817799 0.18085240 0.04745425 0.03157564
We see that there is a large clustering of Austen compared to AP:
cor_ss = cor(dtm["Sense & Sensibility",], t(dtm))[1,]; print(cor_ss[1:5]);
1 2 3 4 5 0.02056637 0.03157958 0.02608218 0.05342795 0.04504179
The max similarity is not symmetrical (closest document/book to document 1 does not have document 1 as its closest document/book):
(index = which.max(cor1[-1]))
276 275
cor_ss = cor(dtm[index,],t(dtm))[1,] which.max(cor_ss[-index]) # not 1!
1126 1125
To see if we can separate the words into different topics, we can use an Latent Dirichlet Allocation (LDA).
Assumptions:
The LDA estimates how much:
For topicmodels::LDA, we need a DocumentTermMatrix (DTM) and can create one using tidytext::cast_dtm:
dtm = tidy_books %>% count(book, word) %>% tidytext::cast_dtm(document = book, term = word, value = n) unique_indexes = unique(dtm$i) # get the index of each unique value # let's try 6 topics lda = topicmodels::LDA(dtm, k = 6L, control = list(seed = 20230910)) topics = tidy(lda, matrix = "beta") head(topics)
# A tibble: 6 × 3 topic term beta <int> <chr> <dbl> 1 1 1 3.72e- 44 2 2 1 2.68e- 5 3 3 1 5.51e- 5 4 4 1 1.21e- 4 5 5 1 3.21e- 5 6 6 1 1.26e-189
# get the to2 12 terms for each topic top_terms = topics %>% group_by(topic) %>% top_n(12, beta) %>% # get the top 12 beta by topic ungroup() %>% # ungroup arrange(topic, -beta) # arrange words in descending informativeness top_terms %>% # take the top terms mutate(term = reorder(term, beta)) %>% # sort terms by beta value ggplot(aes(term, beta, fill = factor(topic))) + # plot beta by theme geom_col(show.legend = FALSE) + # as a bar plot facet_wrap(~ topic, scales = "free") + # which each topic in a seperate plot labs(x = NULL, y = "Beta") + # no x label, change y label coord_flip()
Here we just do 2 topics and show the top differentiated terms:
# let's try 2 topics
lda = topicmodels::LDA(dtm, k = 2L, control = list(seed = 20230910))
topics = tidy(lda, matrix = "beta")
beta_wide = topics %>%
mutate(topic = paste0("topic", topic)) %>%
pivot_wider(names_from = topic, values_from = beta) %>%
filter(topic1 > .001 | topic2 > .001) %>%
mutate(log_ratio = log2(topic2 / topic1))
## Topics contributing to Documents
We can extract the per-document-per-topic probabilities with the gamma param:
(docs = tidy(lda, matrix = "gamma") %>% arrange(topic, desc(gamma)))
# A tibble: 12 × 3 document topic gamma <chr> <int> <dbl> 1 Persuasion 1 0.576 2 Emma 1 0.505 3 Northanger Abbey 1 0.494 4 Sense & Sensibility 1 0.492 5 Mansfield Park 1 0.449 6 Pride & Prejudice 1 0.344 7 Pride & Prejudice 2 0.656 8 Mansfield Park 2 0.551 9 Sense & Sensibility 2 0.508 10 Northanger Abbey 2 0.506 11 Emma 2 0.495 12 Persuasion 2 0.424
What happens when we put the AP and Jane Austen data together?
dtm = dat %>% cast_dtm(document = book, term = word, value = count)
lda = topicmodels::LDA(dtm, k = 2L, control = list(seed = 20230910))
topics = tidy(lda, matrix = "beta")
docs = tidy(lda, matrix = "gamma") %>%
arrange(topic, desc(gamma))
docs %>% filter(grepl("[a-z]", tolower(document)))
# A tibble: 12 × 3 document topic gamma <chr> <int> <dbl> 1 Northanger Abbey 1 0.00000256 2 Persuasion 1 0.00000239 3 Sense & Sensibility 1 0.00000168 4 Pride & Prejudice 1 0.00000164 5 Emma 1 0.00000130 6 Mansfield Park 1 0.00000127 7 Mansfield Park 2 1.00 8 Emma 2 1.00 9 Pride & Prejudice 2 1.00 10 Sense & Sensibility 2 1.00 11 Persuasion 2 1.00 12 Northanger Abbey 2 1.00
Latent Semantic Analysis (LSA), essentially SVD on the document matrix
library("quanteda.textmodels")
full_dfm = dat %>% cast_dfm(document = book, term = word, value = count)
lsa <- textmodel_lsa(full_dfm)
head(lsa$docs)# the number of dimensions to be included in output
[,1] [,2] [,3] [,4] [,5]
1 -0.0004466534 -4.906558e-05 -1.185645e-04 8.283042e-05 6.383163e-04
2 -0.0006519715 -5.232856e-05 -2.256336e-04 1.380608e-04 2.854672e-04
3 -0.0005093699 1.208376e-05 -2.145967e-04 -8.684338e-05 3.412429e-04
4 -0.0006160788 -3.357488e-04 -3.049020e-05 1.061577e-04 6.595658e-04
5 -0.0002627817 6.319020e-05 -8.837352e-05 7.178837e-06 8.245075e-05
6 -0.0009090253 -2.333275e-04 -4.887146e-04 4.987251e-06 2.923236e-04
[,6] [,7] [,8] [,9] [,10]
1 -0.0005237769 0.015878614 0.013804403 -0.0155122222 0.042554724
2 -0.0005547936 0.024556544 -0.002235408 -0.0184513067 -0.001411976
3 -0.0004557827 0.025442438 0.030374267 0.0110309458 0.061814879
4 -0.0004636071 0.019008392 0.001635891 -0.0062274731 -0.001769251
5 -0.0002890673 0.008736657 -0.001281150 0.0004967993 0.009625654
6 -0.0004085316 0.042249741 0.033391202 0.0156427399 -0.008949988
head(lsa$features)
[,1] [,2] [,3] [,4] [,5]
adding -2.995702e-03 -2.723500e-03 -2.274196e-04 3.548317e-03 2.270469e-03
adult -4.870432e-06 -9.564342e-07 -4.454224e-06 -1.799877e-06 1.204294e-05
ago -2.013992e-02 3.795791e-03 -1.977936e-03 7.645610e-03 3.519345e-03
alcohol -8.919886e-06 -2.544332e-06 -4.842384e-06 -1.997830e-06 2.106356e-05
allegedly -9.263670e-06 -2.582939e-06 -8.472069e-06 6.532244e-07 2.239337e-05
allen -9.212321e-03 -7.770502e-04 -9.495120e-03 8.370582e-03 1.548319e-02
[,6] [,7] [,8] [,9] [,10]
adding -2.987219e-03 0.005164597 4.108906e-04 -0.0039396233 0.004163654
adult -1.384768e-05 0.001714656 -1.285100e-03 0.0004245800 0.002373084
ago -1.557159e-03 0.037269853 -1.398956e-02 0.0004772952 0.017388251
alcohol -2.862765e-05 0.003366351 4.158749e-05 -0.0011563062 0.006074888
allegedly -2.667168e-05 0.004216523 2.339087e-03 -0.0060492367 0.007694172
allen -1.741246e-01 -0.007617308 -5.581021e-03 0.0012117878 -0.000107876